feat: add cosine_distance scalar function#21542
Conversation
Add cosine_distance (and list_cosine_distance alias) to compute cosine distance between two numeric arrays. Includes shared vector math primitives in vector_math.rs for reuse by follow-on functions. Part of apache#21536. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Addresses review comments on apache#21542: - Iterate list offsets/values directly instead of per-row ArrayRef downcast - Remove nested-list unwrap loop (function does not support nested lists) - Drop convert_to_f64_array wrapper (coerce_types already guarantees Float64) - Remove duplicate Rust unit tests now covered by SLT - More descriptive error message for mismatched list lengths - Delete now-unused vector_math module; inline math into sole caller Adds SLT coverage for NULL-element-in-list behavior previously tested only in Rust unit tests. Part of apache#21536. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks for the detailed review, @Jefffrey. Rework pushed in fc3ee90. Walking through each comment: 1. Iterate via offsets/values, not per-row ArrayRef downcast ( 2. Nested-list unwrap loop ( 3. Redundant null/Float64 check ( 4. Ambiguous length-mismatch error wording ( 5. Duplicate Rust unit tests ( 6. 7. Inline the math instead of a separate module ( Full validation matrix ( |
Addresses round-2 review comments on apache#21542: - Widen container variant in coerce_types when inputs mix List and LargeList (or FixedSizeList), so mixed-type calls like `cosine_distance([1.0, 0.0], arrow_cast([0.0, 1.0], 'LargeList(Float64)'))` succeed. Follows the pattern from PR apache#21704 (ArrayConcat). - Coerce bare NULL inputs to a matching list variant so `cosine_distance(NULL, [1.0, 2.0])` returns NULL instead of erroring. - Drop the `list_cosine_distance` alias — the base name is not `array_cosine_distance`, so the `array_X` -> `list_X` convention does not apply. - Expand SLT coverage: mixed-type variants, FixedSizeList inputs, Float32 and Int64 inner types, bare NULL in each position, NULL row in a multi-row VALUES, and an unsupported-type plan error case. Dispatch fallthrough in cosine_distance_inner is now unreachable after the coerce_types widening, changed from exec_err! to internal_err!. Part of apache#21536. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks @Jefffrey. Round-3 pushed in ce312cc: 1. Mixed-type inputs ( 2. 3. Bare NULL input ( 4. Multi-row NULL coverage ( Additional SLT coverage added proactively:
Full validation run clean (fmt, clippy, full + sqlite-extended SLT, CLI, doctests, feature-flag checks, |
Summary
cosine_distance(array1, array2)/list_cosine_distance— computes cosine distance (1 - cosine similarity) between two numeric arraysvector_math.rsprimitives (dot_product_f64,magnitude_f64,convert_to_f64_array) for reuse by follow-on vector functionsPart of #21536 — first in a series of split PRs (replacing #21371).
Test plan
cosine_distance.sltcovering all edge cases including empty arrays, LargeList, integer coercion, alias, return typecargo clippy,cargo fmt,taplo,prettier,cargo machete— all clean🤖 Generated with Claude Code